15  NumPy: Working with Arrays

15.1 Introduction

NumPy is a fundamental library for numerical computing in Python, frequently used for scientific research, econometrics, and machine learning. By providing tools to work with large arrays of numeric data efficiently, it forms the computational backbone of the Python scientific ecosystem. These arrays—called NumPy arrays—allow you to store, manipulate, and compute on data more quickly and clearly than Python’s standard containers (lists, tuples, dictionaries).

In typical economics and data science workflows, we often handle large datasets (e.g., panel datasets of countries over many years, simulations with millions of draws, or large cross-sectional data). Operations on such data—like summations, regressions, matrix multiplications, or statistical aggregations—can be very costly if carried out using pure Python loops. NumPy arrays solve this problem by providing contiguous (uninterrupted) memory storage and vectorized operations that run at compiled speeds.

This chapter introduces the major building blocks of NumPy. We will start with a conceptual understanding of what a NumPy array is, how it differs from Python’s built-in structures, and why arrays are so important for efficient computing. We will then explore how to create arrays, reshape them, index and slice them, handle broadcasting and vectorized operations, and save/load data to permanent storage. Our aim is to give you enough understanding of NumPy’s conceptual underpinnings that you can confidently work with the library, setting a foundation for more advanced topics in econometrics and machine learning.

15.2 Importing NumPy

If you use Google Colab or similar platforms, NumPy is already installed. For a local installation, you can install NumPy with:

pip install numpy

By convention, NumPy is imported under the alias np:

import numpy as np

This alias is a nearly universal standard in the Python scientific computing world.

15.2.1 Submodules

  • np.random: For random number generation (uniform, normal, etc.).
  • np.linalg: For linear algebra operations (e.g., matrix inverses, eigenvalues).

The remainder of this chapter will assume you have run import numpy as np.

15.3 Motivation and Conceptual Overview

15.3.1 Why NumPy Arrays?

Before diving into the syntax, consider why economists or data scientists use NumPy arrays instead of pure Python lists:

  1. Homogeneity and Speed
    A NumPy array can only hold one type of data, typically numeric (e.g., all floats, all integers). Because of this restriction, the array is laid out in memory contiguously. This organization is crucial for efficient numerical computing: it allows vectorized operations at the machine-code level without looping in Python.
    In contrast, Python lists can store mixed types and are more flexible, but that flexibility comes at a cost in terms of performance and memory overhead.

  2. Vectorized Operations
    NumPy allows you to perform mathematical operations over entire arrays with very concise syntax. For instance, if you wanted to add 10 to every element in a list of 1 million integers using pure Python, you would likely write a loop and do it one-by-one. In NumPy, this is a single, optimized operation (arr + 10) that is often orders of magnitude faster.

  3. Ease of Manipulation
    Economists commonly work with vectors (1D structures) and matrices (2D structures). NumPy extends easily to higher dimensions (3D, 4D, etc.), which is sometimes required for panel data or more involved data structures. Once you learn the fundamentals of NumPy slicing and indexing, reorganizing, filtering, or reshaping your data becomes straightforward.

15.3.2 Differences from Python Lists

  • Memory Structure: NumPy arrays store data in a contiguous block. Python lists store references to objects scattered around memory.
  • Shape: A Python list is one-dimensional by default. You can nest lists to mimic multi-dimensional data, but it can become unwieldy. NumPy arrays can be created in 1D, 2D, or higher dimensions, and the library provides utilities to reshape and manipulate these dimensions consistently.
  • Fixed Size: A NumPy array’s size is determined upon creation. While you can still concatenate arrays, they are not as freely resizable as Python lists. This constraint helps maintain predictable memory usage and speed.

15.4 Creating and Inspecting Arrays

15.4.1 Array Creation

You create an array by passing a Python list (or list of lists) to np.array(). Here are some common patterns:

import numpy as np

# 1D array (vector)
vector = np.array([10, 20, 30, 40])

# 2D array (matrix)
matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])

# 3D array (cube-like structure)
cube = np.array([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
])

Though this code may look similar to Python lists, the resulting objects, vector, matrix, and cube, are all NumPy arrays—complete with properties that we explore next.

15.4.2 Inspecting Array Attributes

NumPy arrays come with built-in attributes to help you quickly understand their layout and the data they hold:

arr = np.array([[1, 2, 3],
                [4, 5, 6]])

print(arr.shape)    # (2, 3) - tuple of dimensions
print(arr.ndim)     # 2      - number of dimensions
print(arr.size)     # 6      - total number of elements
print(arr.dtype)    # int64  - data type of elements
  • shape: The dimensions of the array (rows × columns in 2D).
  • ndim: How many dimensions the array has (2D, 3D, etc.).
  • size: Total number of elements in the array.
  • dtype: The type of each element (e.g., float64, int64).

15.4.2.1 Why Data Types (dtype) Matter

NumPy arrays are homogeneous, meaning all elements share the same type. In economics or finance, you often deal with floating-point data (e.g., real numbers), so your arrays might typically have float64 as their dtype. However, if you need to store only integers (for instance, integer-coded categories), you could use int64. This can save memory and make certain computations more consistent.

15.5 Special Array Constructors

NumPy provides convenient functions for constructing arrays without having to manually type out lists:

  • Zeros and Ones
    Useful for initializing arrays of a given shape with all zeros or ones:

    zeros = np.zeros((3, 4))  # 3 rows, 4 columns of zeros
    ones  = np.ones((2, 2))   # 2 rows, 2 columns of ones
  • Empty
    Creates an uninitialized array. Its initial content is arbitrary, so it is usually used when you plan to fill the array later:

    empty = np.empty((2, 3))
  • Identity Matrix
    Commonly needed in linear algebra (e.g., an identity matrix in regressions):

    I = np.eye(3)  # 3x3 identity
  • Ranges
    NumPy’s version of Python’s range() is np.arange(), which generates a sequence of values:

    range_array = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

These constructors help you allocate arrays for typical operations in economics and econometrics, such as setting up design matrices for regressions, placeholders for iterative algorithms, or identity matrices for transformations.

15.6 Reshaping and Transposition

A crucial feature of NumPy arrays is their shape manipulation, which allows you to reorganize data easily.

arr = np.array([1, 2, 3, 4, 5, 6])
matrix = arr.reshape(2, 3)

Here, a one-dimensional array with 6 elements is reshaped into a 2×3 matrix. Internally, NumPy does not copy data if it’s not necessary; it simply changes how the data are viewed.

15.6.0.1 Transposition

For a 2D array, you can transpose rows and columns:

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
transposed = matrix.T

Transposition is useful in linear algebra (e.g., to compute \(\mathbf{X}^\top \mathbf{X}\) in a regression).

15.6.0.2 Expand and Squeeze Dimensions

Sometimes you must expand or reduce dimensions to meet the needs of a function:

vector = np.array([1, 2, 3])             # shape (3,)
col_vec = np.expand_dims(vector, axis=1) # shape (3, 1)
row_vec = np.expand_dims(vector, axis=0) # shape (1, 3)

Adding or removing “axes” in your arrays is common when working with broadcasting or advanced data manipulation routines.

15.7 Indexing and Slicing

15.7.1 Basic Indexing

Indexing in NumPy is similar to indexing in Python lists, but extended to multiple dimensions. If arr is 2D, arr[i, j] refers to the element in the \(i\)-th row and \(j\)-th column. For example:

matrix = np.array([
    [10, 20, 30],
    [40, 50, 60]
])

print(matrix[0, 1])  # 20
print(matrix[1, 2])  # 60

Indexing is essential for retrieving or modifying specific elements of your data.

15.7.2 Slicing

Slicing allows you to select subregions (contiguous blocks) of the array without copying data unnecessarily. The syntax is [start:stop:step].

  1. 1D slicing:

    arr = np.array([0, 1, 2, 3, 4, 5])
    print(arr[1:4])  # [1, 2, 3]
    print(arr[::2])  # [0, 2, 4]
  2. 2D slicing:

    matrix = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]
    ])
    submatrix = matrix[0:2, 1:3]  # rows 0 & 1, columns 1 & 2
    # result: [[2, 3],
    #          [5, 6]]

Slicing is a powerful tool for extracting or altering data subsets, e.g., selecting the first 100 observations, focusing on specific columns in a dataset, or partitioning time-series data into separate intervals.

15.7.2.1 Views vs. Copies in Slices

Crucially, slices often return “views,” not copies. Altering a slice can change the original array. If you truly need an independent piece of data, you should explicitly copy:

slice_view = matrix[0:2, :]
slice_copy = matrix[0:2, :].copy()

15.7.3 Boolean Indexing

Boolean indexing allows you to filter elements based on some logical condition. For instance, if you have an array of returns and you want to extract only positive ones:

returns = np.array([-0.02, 0.01, 0.04, -0.01])
mask = (returns > 0)
pos_returns = returns[mask]  # array([0.01, 0.04])

This is analogous to “select where returns are positive,” an operation fundamental to data cleaning and outlier detection in empirical analysis.

15.7.4 Fancy Indexing

Fancy indexing allows you to pull elements by specifying an array of integer indices:

arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 3, 4])
selected = arr[indices]  # array([10, 40, 50])

This approach is useful when you already have identified the positions of special elements you want to extract—like matching certain time points or flagged observations.

15.8 Broadcasting

Broadcasting is one of NumPy’s most powerful (and initially non-intuitive) features, allowing you to perform arithmetic on arrays of different shapes.

15.8.1 Scalar Broadcasting

If you add a scalar to an array, NumPy “broadcasts” the scalar to match the array’s shape. That is, it treats the scalar as if it were an array of the same shape and type:

arr = np.array([1, 2, 3])
print(arr + 5)  # [6, 7, 8]

Conceptually, 5 became [5, 5, 5] under the hood.

15.8.2 Array Broadcasting

Two arrays with different shapes can still be compatible if one dimension can be repeated (“broadcast”) to match the other. For instance:

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])              # shape (2, 3)

vector = np.array([10, 20, 30])  # shape (3,)

result = matrix + vector
# shape (2, 3), done row-wise:
# [[11, 22, 33],
#  [14, 25, 36]]

In economic modeling or simulation contexts, broadcasting allows a concise expression of operations like adding a constant inflation term across all goods or all time periods, or applying a single coefficient vector to multiple data points in a matrix. The NumPy documentation describes broadcasting rules in more detail, but at a high level:

  1. NumPy compares the arrays dimension by dimension from right to left.
  2. They are considered compatible in a dimension if they are the same size, or if one has size 1 (which can be stretched).
  3. If the arrays differ in a dimension where neither is 1, broadcasting fails, resulting in an error.

15.9 Elementwise and Aggregation Functions

15.9.1 Elementwise Functions

NumPy provides vectorized mathematical functions that act on whole arrays at once:

arr = np.array([0, np.pi/2, np.pi])

print(np.sin(arr))  # [0.0, 1.0, 1.2246e-16]
print(np.exp(arr))
print(np.sqrt(arr))

No loops are needed, and computations are efficient. This design underpins many advanced data analysis libraries (like pandas, statsmodels, and more) that build upon NumPy.

15.9.2 Aggregation Functions

Aggregation combines elements of an array into a single value or a set of values:

matrix = np.array([
    [1, 2, 3],
    [4, 5, 6]
])
print(np.sum(matrix))   # 21
print(np.mean(matrix))  # 3.5
print(np.max(matrix))   # 6

Additionally, you can specify an “axis” along which to aggregate:

  • axis=0: Aggregate down the columns.
  • axis=1: Aggregate across the rows.
print(np.sum(matrix, axis=0))  # [5, 7, 9]
print(np.sum(matrix, axis=1))  # [6, 15]

In economics, summing across rows might correspond to summing variables across regions; summing across columns might correspond to summing observations across time.

15.10 Linear Algebra

Many standard linear algebra routines are available in np.linalg, including matrix multiplication, inversion, and eigenvalue decomposition:

a = np.array([[1, 2],
              [3, 4]])
b = np.array([[5, 6],
              [7, 8]])

# Matrix multiplication
c = np.dot(a, b)
# or equivalently
c_alt = a @ b

# Inverse
inv_a = np.linalg.inv(a)

# Solve Ax = b
b_vec = np.array([1, 2])
x = np.linalg.solve(a, b_vec)

These are cornerstones of econometric operations (e.g., OLS regressions often involve matrix multiplication and inversion) and provide a direct path to handle fundamental tasks in quantitative modeling.

15.11 Comparisons and Logical Operations

NumPy supports elementwise and arraywise comparisons:

arr1 = np.array([1, 2, 3])
arr2 = np.array([2, 2, 1])

print(arr1 == arr2)  # [False, True, False]
print(np.array_equal(arr1, arr2))  # False

Such operations make it straightforward to identify matching records, detect missing or invalid data, and implement condition-based logic (e.g., “select all rows where GDP > 1000”).

15.12 Random Numbers

Random numbers are indispensable for Monte Carlo experiments, bootstrapping, or stochastic simulations in economic modeling. NumPy offers a dedicated submodule, np.random, with many useful functions:

np.random.seed(42)   # for reproducibility

# Uniform distribution
sample_uniform = np.random.rand(3, 3)

# Normal distribution
sample_normal = np.random.normal(0, 1, 1000)

# Integer random numbers
sample_ints = np.random.randint(0, 10, 5)

By specifying a seed, you ensure reproducible results, which is vital for research transparency and consistent results across runs.

15.13 Sorting and Searching

Sorting is important when you need data ordered by time, magnitude, or any other metric:

arr = np.array([3, 1, 4, 1, 5, 9])
sorted_arr = np.sort(arr)  # returns a sorted copy

You can also sort along axes in 2D arrays:

matrix = np.array([[3, 1, 4], [1, 5, 9]])
print(np.sort(matrix, axis=0))  # Sort each column
print(np.sort(matrix, axis=1))  # Sort each row

Searching utilities, such as np.argsort (to get the indices that would sort an array) or np.searchsorted (binary search in a sorted array), are especially relevant for time-series alignment or indexing events chronologically.

15.14 Saving and Loading Data

Because economic and machine learning analyses often involve large datasets or repeated computations, saving and loading arrays in an efficient format is critical:

data_2d = np.array([[1, 2, 3],
                    [4, 5, 6]])
np.savetxt('array_2d.csv', data_2d, delimiter=',')

# Reload from CSV
loaded_data_2d = np.loadtxt('array_2d.csv', delimiter=',')

For higher-dimensional data, CSV may not be straightforward (you need to reshape). NumPy’s binary format (.npy) handles any dimension without hassle:

# Save to .npy
np.save('array_3d.npy', data_2d)

# Load
restored = np.load('array_3d.npy')

Using .npy or .npz (zipped) files is faster for large data, preserves data types, and avoids CSV’s limitations.

15.15 Concluding Remarks

NumPy arrays are at the core of fast, efficient data management in Python. For economists who are accustomed to thinking in terms of vectors and matrices, NumPy provides a natural programming interface that extends seamlessly to higher dimensions when needed. Key points to remember:

  1. Concept of Contiguous Data
    NumPy arrays store data in contiguous memory blocks, enabling efficient computation and vectorized operations.

  2. Shape Manipulation and Indexing
    You can reshape arrays, slice them, and index them in a variety of ways to work with exactly the portion of your data you need.

  3. Broadcasting
    This feature facilitates concise and powerful arithmetic on arrays of differing shapes, which is particularly handy in model building and simulation.

  4. Linear Algebra Integration
    NumPy provides well-optimized routines for matrix operations, eigenvalues, and other standard linear algebra procedures, ubiquitous in econometrics.

  5. Data Persistence
    Arrays can be saved and loaded quickly, making repeated analyses or large simulations manageable.

With a firm grasp of these fundamentals, you will find it easier to adopt more advanced data tools, such as pandas for data frames, statsmodels for econometric analysis, or scikit-learn for machine learning. As you proceed, keep these principles in mind: NumPy’s efficiency is often the foundation upon which entire analytic pipelines are built. By leveraging arrays effectively, you will set yourself up for success in handling large datasets, running complex models, and ultimately deriving more insights from your data.